Search CORE

14 research outputs found

Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor

Repository TU/e

Crossref

Pure OAI Repository

A data-reuse aware accelerator for large-scale convolutional networks

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue
Publication date: 01/01/2014
Field of study

This paper presents a clustered SIMD accelerator template for Convolutional Networks. These networks significantly outperform other methods in detection and classification tasks in the vision domain. Due to the excessive compute and data transfer requirements these applications benefit a lot from a dedicated accelerator. The proposed accelerator reduces memory traffic by loop transformations such as tiling and fusion to merge successive layers. Although fusion can introduce redundant computations it often reduces the data transfer, and therefore can remove performance bottlenecks. The SIMD cluster is mapped to a Xilinx Zynq FPGA, which can achieve 6.4 Gops performance with a small amount of resources. The performance can be scaled by using multiple clusters

CiteSeerX

Repository TU/e

Pure OAI Repository

Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue: 'EDAA'
Publication date: 01/01/2015
Field of study

Crossref

VLIW Code Generation for a Convolutional Network Accelerator

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Stuijk S.
Wisnu Pramadi W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

This paper presents a compiler flow to map Deep Convolutional Networks (ConvNets) to a highly specialized VLIW accelerator core targeting the low-power embedded market. Earlier works have focused on energy efficient accelerators for this class of algorithms, but none of them provides a complete and practical programming model. Due to the large parameter set of a ConvNet it is essential that the user can abstract from the accelerator architecture and does not have to rely on an error prone and ad-hoc assembly programming model. By using modulo scheduling for software pipelining we demonstrate that our automatic generated code achieves equal or within 5-20% less hardware utilization w.r.t. code written manually by experts. Our compiler removes the huge manual workload to efficiently map ConvNets to an energy-efficient core for the next-generation mobile and wearable devices

Crossref

Pure OAI Repository

Improving the efficiency of deep convolutional networks

Author: Peemen M.C.J.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2017
Field of study

Repository TU/e

Pure OAI Repository

Mapping convolutional neural networks on a reconfigurable FPGA platform

Author: Peemen M.C.J.
Publication venue
Publication date: 01/10/2010
Field of study

Pure OAI Repository

Speed sign detection and recognition by convolutional neural networks

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue
Publication date: 01/01/2011
Field of study

From the desire to update the maximum road speed data for navigation devices, a speed sign recognition and detection system is proposed. This system should prevent accidental speeding at roads where the map data is incorrect for example due to construction work. Multiple examples of road sign classification systems already exist but none uses a fully trainable solution. This feature enables the "vendor" to easily add new speed signs by training with a set of examples instead of designing a new system. To meet the above requirements a fully trainable Convolutional Neural Network (CNN) is used for the detection and recognition of speed signs. The system is trained with a labelled set of examples of speed sign images. Training of the total classification system is done off-line with the of error back-propagation algorithm. A trained system is used to collect new training data from road scene images to learn from previous errors, this technique is known as boosting. After the boosting step 0.19% of the images in our online available test set are misclassified. For the detection application the search window of the trained CNN is scaled to a 1280×720 HD image size to detect speed signs at multiple scales and positions in front of a vehicle. Because of the massive amount of parallelism in the computations of a CNN the algorithm is mapped to a Graphics Processing Unit (GPU). The GPU implementation demonstrates the abilities of the recognition system on a low cost consumer platform with a real-time frame rate of 35 fps

Repository TU/e

Pure OAI Repository

Minimal data transfer by iteration reordering for loop nest accelerators

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue
Publication date: 01/01/2013
Field of study

Repository TU/e

Pure OAI Repository

Data locality modeling for convolutional neural networks

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue: 'Solutionwork'
Publication date: 01/01/2012
Field of study

Repository TU/e

Pure OAI Repository

Optimal iteration scheduling for intra- and inter-tile reuse in nested loop accelerators

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue: Eindhoven University of Technology
Publication date: 01/01/2013
Field of study

High Level Synthesis tools have reduced accelerator design time. However, a complex scaling problem that remains is the data transfer bottleneck. Accelerators require huge amounts of data and are often limited by interconnect resources. Local buffers can reduce communication by exploiting data reuse, but the data access order has a substantial impact on the amount of reuse that can be utilized. With loop transformations such as interchange and tiling the data access order can be modified. However, for real applications the design space is huge, finding the best set of transformations is often intractable. Therefore, we present a new methodology that minimizes the data transfer by loop interchange and tiling. In contrast to other methods we take inter-tile reuse and loop bounds into account. For real-world applications we show buffer size trade-offs that can give speedups up to 14x, alternatively these can reduce the required FPGA resources substantially

Repository TU/e

Pure OAI Repository